Biostatistics For Dummies (Monika Wahi John Pezzullo)

Table 17-1 shows theoretical coding for a data set containing the variables StudyID (for participant

ID) and PrimaryDx (for participant primary diagnosis). As shown in Table 17-1, you take each level

and make an indicator variable for it: Hypertension is HTN, diabetes is Diab, cancer is Cancer, and

other is OtherDx. Instead of including the variable PrimaryDx in the model, you’d include the

indicator variables for all levels of PrimaryDx except the reference level. So, if the reference level

you selected for PrimaryDx was hypertension, you’d include Diab, Cancer, and OtherDx in the

regression, but would not include HTN. To contrast this to the education example, in the set of

variables in Table 17-1, participants can have a 1 for one or more indicator variables or just be in the

reference group. However, with the education example, they can only be coded at one level, or be in

the reference group.

Don’t forget to leave the reference-level indicator variable out of the regression, or your

model will break!

Creating scatter charts before you jump into multiple regression

analysis

One common mistake researchers make is immediately running a regression or another advanced

statistical analysis before thoroughly examining their data. As soon as your data are available in

electronic format, you should run error-checks, and generate summaries and histograms for each

variable you plan to use in your regression. You need to assess the way the values of the variables are

distributed as we describe in Chapter 11. And if you plan to analyze your data using multiple

regression, you need special preparation. Namely, you should chart the relationship between each

predictor variable and the outcome variable, and also the relationships between the predictor

variables themselves.

Imagine that you are interested in whether the outcome of systolic blood pressure (SBP) can be

predicted by age, body weight, or both. Table 17-2 shows a small data file with variables that could

address this research question that we use throughout the remainder of this chapter. It contains the age,

weight, and SBP of 16 study participants from a clinical population.

TABLE 17-2 Sample Age, Weight, and Systolic Blood Pressure Data for a

Multiple Regression Analysis

Participant ID Age (years) Weight (kg) SBP (mmHg)

117

120

145

129

132

130

110

163

136